ENVX1002 Introduction to Statistical Methods
The University of Sydney
Dec 2024
Last lecture…
Y_i = \beta_0 + \beta_1 x_i + \epsilon_i
Ideal for predicting a continuous response variable from a single predictor variable: “How does y change as x changes, when the relationship is linear?”
Y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_k x_{ki} + \epsilon_i
“How does y change as x_1, x_2, …, x_k change?”
Y_i = f(x_i, \beta) + \epsilon_i
where f(x_i, \beta) is a nonlinear function of the parameters \beta: “How do we model a change in y with x when the relationship is nonlinear?”
Carl Friedrich Gauss (1777-1855) and Isaac Newton (1642-1726).
Linear relationships are simple to interpret since the rate of change is constant.
“As one changes, the other changes at a constant rate.”
Nonlinear relationships often involve exponential, logarithmic, or power functions.
“As one changes, the other changes at a rate that is not proportional to the change in the other.”
In a relationship where y is a function of x^a, as y increases, x increases at a rate that is equal to x to the power of a.
Interpretation:
| Exponents | Logarithms | |
|---|---|---|
| Definition | If a^n = b, a is the base, n is the exponent, and b is the result. | If \log_a b = n, a is the base, b is the result, and n is the logarithm (or the exponent in the equivalent exponential form). |
| Example | 2^3 = 8 | \log_2 8 = 3 |
| Interpretation | 2 raised to the power of 3 equals 8. | The power to which you must raise 2 to get 8 is 3. |
| Inverse | The logarithm is the inverse operation of exponentiation. | The exponentiation is the inverse operation of logarithm. |
| Properties | (a^n)^m = a^{n \cdot m}, a^n \cdot a^m = a^{n+m}, \frac{a^n}{a^m} = a^{n-m} | \log_a(b \cdot c) = \log_a b + \log_a c, \log_a\left(\frac{b}{c}\right) = \log_a b - \log_a c, \log_a(b^n) = n \cdot \log_a b |
Often, a nonlinear relationship may be transformed into a linear relationship by applying a transformation to the response variable or the predictor variable(s).
f(x_i, \beta)
Response variable decreases and approaches limit as predictor variable increases.
y = a \cdot e^{-b_x}
Examples: radioactive decay, population decline, chemical reactions.
Response variable increases and approaches a limit as the predictor variable increases.
y = a + b(1 - e^{-cx})
Examples: population growth, enzyme kinetics.
An S-shaped relationship, where the response variable is at first exponential, then asymptotic.
y = c + \frac{d-c}{1+e^{-b(x-a)}}
Examples: growth of bacteria, disease spread, species growth.
Response variable changes in a variety of ways as the predictor variable changes.
y = a + bx + cx^2 + dx^3 + ...
Examples: food intake, drug dosage, exercise.
How far can we go?
“You need to know a bit of calculus.”
nls() function to fit the following nonlinear models:
A special case of multiple linear regression used to model nonlinear relationships.
Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + ... + \beta_k x_i^k + \epsilon_i
where k is the degree of the polynomial.
lm().Using the asymptotic data
See Slide 11 for the relationship and mathematical expression.
Y_i = \beta_0 + \beta_1 x_i + \epsilon_i
Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \epsilon_i
Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \epsilon_i
Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + ... + \beta_10 x_i^{10} + \epsilon_i
Call:
lm(formula = response ~ poly(predictor, 10), data = asymptotic)
Residuals:
Min 1Q Median 3Q Max
-17.1659 -8.6908 -0.0494 8.8003 16.4012
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 79.818 1.552 51.426 < 2e-16 ***
poly(predictor, 10)1 159.368 11.084 14.378 < 2e-16 ***
poly(predictor, 10)2 -106.939 11.084 -9.648 5.37e-12 ***
poly(predictor, 10)3 48.570 11.084 4.382 8.28e-05 ***
poly(predictor, 10)4 -19.411 11.084 -1.751 0.0876 .
poly(predictor, 10)5 1.193 11.084 0.108 0.9148
poly(predictor, 10)6 -2.769 11.084 -0.250 0.8040
poly(predictor, 10)7 -1.343 11.084 -0.121 0.9042
poly(predictor, 10)8 -4.009 11.084 -0.362 0.7195
poly(predictor, 10)9 -2.851 11.084 -0.257 0.7984
poly(predictor, 10)10 5.769 11.084 0.520 0.6056
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.08 on 40 degrees of freedom
Multiple R-squared: 0.8897, Adjusted R-squared: 0.8621
F-statistic: 32.26 on 10 and 40 DF, p-value: 4.846e-16
lm().If you have some understanding of the underlying relationship (e.g. mechanistic process) between the variables, you can fit a nonlinear model.
Y_i = f(x_i, \beta) + \epsilon_i
where f(x_i, \beta) is a nonlinear function of the parameters \beta.
Like the linear model, the nonlinear model assumes:
Basically:
\epsilon_i \sim N(0, \sigma^2)
Like all other models we have seen, we focus on the residuals to assess the model fit, since the residuals are the only part of the model that is random.
Source: Wikipedia
use nls() function in R.
formula: a formula object, with the response variable on the left of a ~ operator, and the predictor variable(s) on the right.data: a data frame containing the variables in the model.start: a named list of starting values for the parameters in the model.y = a + b(1 - e^{-cx})
y = a + b(1 - e^{-cx})
Formula: response ~ a + b * (1 - exp(-c * predictor))
Parameters:
Estimate Std. Error t value Pr(>|t|)
a -14.51755 6.64162 -2.186 0.0337 *
b 113.03797 6.44716 17.533 < 2e-16 ***
c 0.62962 0.07142 8.816 1.32e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.21 on 48 degrees of freedom
Number of iterations to convergence: 4
Achieved convergence tolerance: 1.178e-06
y = -14.5 + 113.04(1 - e^{-0.63x}) The R-square value is not reported for nonlinear models as the sum of squares is not partitioned into explained and unexplained components.
y = c + \frac{d-c}{1+e^{-b(x-a)}}
y = c + \frac{d-c}{1+e^{-b(x-a)}}
y = c + \frac{d-c}{1+e^{-b(x-a)}}
y = c + \frac{d-c}{1+e^{-b(x-a)}}
nls() function requires a formula and starting point(s) for the parameters.
SSasymp(), SSlogis(), SSmicmen(), SSweibull(), etc.Important
We still need to have some understanding of the underlying relationship between the variables to pick the right function.
y = c + \frac{d-c}{1+e^{-b(x-a)}}
input: the predictor variable.Asym: the upper limit.xmid: the value of x when y is halfway between the lower and upper limits.scal: the rate of change.The equation ia different: see ?SSlogis:
y = \frac{Asym}{1+exp \frac{xmid-input}{scal}}
Other than input, the other parameters can be left to the function to estimate.
Formula: response ~ SSlogis(predictor, Asym, xmid, scal)
Parameters:
Estimate Std. Error t value Pr(>|t|)
Asym 310.64727 4.62579 67.16 <2e-16 ***
xmid 4.92715 0.07142 68.99 <2e-16 ***
scal 1.34877 0.05418 24.90 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.22 on 48 degrees of freedom
Number of iterations to convergence: 1
Achieved convergence tolerance: 6.626e-07
The self-starting function for the asymptotic model is SSasymp().
Comparing outputs:
In some cases, the fits are identical, but in others, they are not.
Note: this is non-examinable content but might be useful for your project.
We can use prediction quality metrics to compare the fits.
Use the broom package to extract the AIC and BIC values from the model fits.
# A tibble: 4 × 3
model AIC BIC
<chr> <dbl> <dbl>
1 linear 453. 459.
2 poly2 409. 416.
3 poly3 392. 402.
4 poly10 402. 425.
Looks like a lot of work, but it’s easy in R using the caret package.
| MAE | RMSE | |
|---|---|---|
| linear | 15.751679 | 19.76680 |
| quadratic | 9.881451 | 11.93538 |
| cubic | 9.741090 | 11.27664 |
| poly10 | 11.774449 | 13.70916 |
From the results, the cubic model has the lowest RMSE and MAE, so is the best model.
This presentation is based on the SOLES Quarto reveal.js template and is licensed under a Creative Commons Attribution 4.0 International License.